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AUTOREGRESSIVE  SPECTRAL  ESTIMATION.  L*M. 
SPECTRAL  smoothing.  AND  ENTROPY* 


Eaanuel  Parzen 
Institute  of  Statistics 
Texas  AAN  University 
College  Station,  TX  77H4i 


Abstract 

Spectral  estimation  Is  motivated  by  information 
divergence  distance.  Two  methods  of  spectral  estima¬ 
tion  are  developed  In  this  paper:  autoregressive 
spectral  estlsMtlon  (section  3)  and  log  spectral 
kernel  eatlaatlon  (section  4).  They  are  motivated  as 
parametric  and  non-paraaetrlc  estimators  which  mini¬ 
mize  ’’entropy"  or  "information  divergence"  distances 
between  raw  and  fitted  spectral  densities.  The  role 
of  entropy  concepts  In  the  statistical  estimation  of 
spectral  densities  (section  2)  is  explained  by  con¬ 
trasting  it  with  the  role  of  entropy  concepts  in  pro¬ 
bability  density  estimation  (section  1).  Adaptive 
procedures  for  forming,  and  combining,  these  estima¬ 
tors  for  an  observed  time  series  are  provided  by 
order-determining  and  truncation  (half-power)  point 
determining  criteria,  which  are  described. 

i .  The  role  of  entropy  coocepts  In  statistical 
cion  of  probability  density  functions. 

Let  X  be  a  continuous  random  variable,  and 

X, . X  a  random  sample  of  X  (consisting  of  indepen- 

I  n 

dent  random  variables  Identically  distributed  as  X). 
The  distribution  function  F(xJ  and  the  proba¬ 

bility  density  function  f(x),-«<x<»,  are  defined  by 

F(x)«Pr(X<x).  f(x)«P'(x)  . 

The  entropy  (or  Shannon  Information)  of  X  le 
denoted  by  H(f)  and  la  defined  by 

H(f)-  £|-log  f(x)  f(x)dx 
-Ejdog  f(X)l  . 

All  observed  distributions  are  eseuswd  to  have  finite 
entropy . 

A  maximum  entropy  denalty  is  a  probability  den¬ 
sity  f(x)  determined  by  maximizing  H(f)  over  all  f 
saciafylng  certain  constraints  (usually  Involving 
moments  of  f) . 

Theorem  LA:  Three  Important  densities,  and  their 
characterization  by  a  maxlmimi  entropy  principle  are: 
(1)  uniform  distribution  over  an  interval  a  to  b 
maximizes  H(f)  over  the  constraint  that  f  is  non-zero 
only  on  the  interval  a  to  b;  (2)  exponential  diatrl- 
bucion  with  mean  u  maximizes  H(f)  over  the  constraint 
that  f  is  non-zero  only  for  x>0,  and  has  mean  u  :  (3) 

normal  distribution  with  mean  u  and  variance  maxi¬ 
mizes  H(f)  over  tlie  constraint  that  f  has  mean  u  and 
variance 

The  maximta  entropy  principle  is  a  probability 
modeling  principle  in  tha  foregoing  examples.  It 
becomes  a  statistical  estlsution  principle  (which  fits 
distributions  to  data)  when  the  (onscraints  on  f  are 
expressed  in  terms  of  sa^)le  means  and  vsrlsnces;  it 
la  then  similar  to  tha  method  of  momenta.  A  maxlmvn 
entropy  denaicy  estimator  f  can  be  expressed  In 
symbols: . 

H(f)-max  H(f) 
f 

where  f  Is  constrained  to  have  certain  moments  equal 
to  the  corresponding  simple  moments. 

An  slcemativc  (and,  we  believe,  more  general) 
statistical  estimation  principle  is  provided  by  the 

This  r^earch  was  supportad  by  the  Office  of  Naval 
.Research  (Contract  NOOOU-ai-MP-10001,  ARO  0AAC29- 
RO-C0070). 


cross-entropy  H(g;f)  and  loformat Ion  divergence  I(g;f) 
of  c%fo  probability  Jenwlcy  functions  f(x)  and  g(x>. 
Define 

H(g:f)  ■  /  t-log  g(x)}f(x}dx 
-E^I-log  g(X)J; 

i(g:f)  •  r  fl^^^^*^*** 

*E  [-log  ^ 1 
f(X)' 

Note  chat  H(f)*H(f;f).  Another  nasm  for  information 
divergence  is  Kullback-I.lebler  information  number.  A 
minimum  information  divergence  density  g  is  an  approxi¬ 
mator  CO  a  specified  density  f  determined  bv 

I(g:f>  •  min  I(g;f) 

g 

where  g  Is  constrained  to  belong  to  a  specified  para¬ 
metric  family  of  probability  denticles. 

Theorem  IB:  Three  important  examples  of  minimum 
information  divergence  approximators  or  estimators 
are:  (1)  f  is  assumed  to  be  positive  only  over  Che 
Interval  a  to  b,  and  g  is  any  uniform  distribution; 

Chen  g  is  Che  uniform  distribution  over  s  to  b;  (2) 
f  Is  positive  only  for  x^s,  and  has  a  finite  mean  u, 
and  g  is  Che  two  parameter  exponential  distribution; 
then  g  is  the  exponential  distribution  with  mean  u  and 
domain  (a,«);  (3)  f  has  finite  mean  u  and  variance 
and  g  Is  any  normal  distribution;  then  g  is 

N(j.(j2). 

Theorem  IB  may  have  been  first  explicitly  formula¬ 
ted  by  Thiel  (1981),  although  it  Is  implicitly  known 
through  the  equivalence  of  maximum  likelihood  estima¬ 
tion  with  minimum  information  divergence  estimation. 

An  Important  observation  by  Thiel  is  that  Theorem  IB 
can  be  used  Co  prove  Theorem  lA,  and  thus  avoid  the  use 
of  the  calculus  of  variations.  We  extend  this  observa¬ 
tion  to  spectral  density  estimation  In  section  2. 

A  minimum  Information  divergence  density  g  can  be 
expressed  as  a  mlnlDun  cross-entropy  density: 

H(g;E)  -  min  H(g;f) 

B 

A  cross-entropy  can  be  defined  for  an  arbitrary 
(Including  discrete)  distribution  function  F(x)  by 

H(k;F)  •  /*  (-log  g(x)ldF(x) 

Therefore  a  minimum  Information  divergence  density  g 
can  be  defined  for  an  arbitrary  distribution  function 
F  by 

H(g;F)  •  Bin  H(g;F) 

g 

where  g  Is  constrained  to  belong  to  a  specified  para¬ 
metric  family  of  probability  denaltles.  Theorem  IB  is 
true  for  titls  definition. 

Consider  now  a  finite  sample  X^,...,X||  and  a 
parametric  model  f»f(x)  for  the  true  probability  density 
f(x>.  Indexed  by  parameters  (  which  one  would  like  to 
estimate  from  the  sample.  A  maximum  likelihood  estl- 
mator  of  9  Is  defined  as  the  parameter  values  t)  maxi¬ 
mizing  . 

L  <'<)  -  •  log  frtfX, . X  ) 

n  n  9  I  n 

n 


Let  F(x>,  <’«>•  X'  ”  ,  denote  the  sample  dlscrtbutlun 
defined  by 


1 


F'(x)  ■  fraction  of  X . •'  •  )«  . 

I  n  - 

•)n«  cait  express 

l._^(0)  -  r  log  f,^(x)  dF(x) 

-  -  . 

Therefore  oexioiua  likelihood  parSAeter  estlAacors  ^ 
yield  lelnlaui  Inforautlon  divergence  densities 

Ry  Introducing  ^  to  denote  the  (eyabollc)  sae«>le 
probability  density  of  the  saaple.  one  can  regard  4 
as  satisfying 

nfgjf)  -  Ain  Ufgjf)  - 

A  re' Interpretation  of  nexlteua  likelihood  la  obtained 
by  revrlting  the  Informetlon  divergence  In  terae  of 
quantile  functions  whose  role  In  statistical  infer' 
ence  Is  eaphaalzed  by  Parzen  (1979). 

Introduce  the  sanple  quantile  function 

q(u)  -  r^(u)  . 


S<W,  .  t  e-^"“"R(v) 

v»-» 

...  a  -2xiwv  , 
t(w)  •  e  o(v) 

The  spectral  distribution  function  la  defined  by 
w 

F(w>  •  /  f(w')  dw*  ,  0  £  w  1  . 

0 

When  p(v)  la  not  aaauaed  to  be  suABUble.  there  always 
enlace  a  spectral  dlatributioo  function  ?(w>«  O^w^l, 
such  that  ^ 

d(v)  •  /  e^’^*“'^dF{w) 

0 

When  o(v)  la  suaftable,  it  has  the  spectral  repreaenta- 
tion  ^ 

i>(v)  •  /  (w)dw  . 

0 


Its  derivative  q(u)  *  Q*(u)  saclafiea 
q(u)  f(Q(u))  -  1  . 


Define 


d.(u) 


aQ(u)) 


where  F^(x)  Is  the  dlatrlbutlon  function  with  density 
f,,(x)>  Hake  Che  change  of  variable  u  •  ^(x)» 

X  «  ()(u)  CO  obcaln 

r(fg:f)  -  r  -lo* 

which  one  can  Interpret  as  a  aeaaure  of  how  close  to 
a  unlfora  density  la  The  full  conaequances 

of  this  Interpretatlon^are  explored  elsewhere. 

It  should  be  noted  that  one  can  define  other 
■eaaurea  to  nlnlalze  to  fora  parastccer  eeclaatora: 
exeaples  are 

/g  (d^(u)-ll^(Ju  . 

whose  alnlalzetlon  leeda  to  ''aodifled  chi-square'* 
estlaatora,  and 


1 

/g  (Fg((5('i))-ul^du  . 

Whose  ainlalzatlon  leeda  to  "alnlau  distance  estl' 
natora". 

InfocMtlon  divergence  Is  the  aeaeure  chat  aoat 
readily  gancrallzca  to  stochaatlc  processes. 

It  should  be  noted  chat  only  the  paraaatar 
•■riMtinn  probLea  la  efficiently  solved  by  olnialzlng 
I(f0;f).  The  problea  of  goodness  ot  fit  la 
solved  by  considering  the  site  of  the  difference  froa 
Che  unlfora  distribution  0(u)  •  u  of  0^ (u)*p0(Q(u) ) 
for  9  "  The  acMial  idantlflcatlon  problea  la  to 
find  distribution  functions  t  such  that  ^(<)(u))  la 
parslAonloualy  not  slgnlf icaocly  different  froa  the 
unlfora  distribution  0(u)  •  u. 


2.  Tha  rola  of  ancropy  concaota  In  acatlatlcel  eatl- 
aetlon  of  spectral  density  functloos. 

Let  Y(c),  C  •  1,  ...»  T  be  a  aeaple  of  a  Caua- 
•laa  zero  aaan  stationary  claa  sarlaa  with  covariance 
function 

*(v)  -  ElY(t)Y(t+v>l,  V  -  0.  ♦•1.  +2 . 

and  correlation  fiaictlon 

dW)  •  “  Corr|Y(t)  ,Y{ffv)  ]  . 

We  aaauae  R(v)  and  p(v)  are  abeolutely  smseble. 
and  define  the  power  apectr\ai  S(w).  0  ^  ^  l,  and 

the  aoectrel  density  f(w).  0  ^  v  ^  It  by 


A  stationary  Gaussian  claa  sarlca  la  callad 
Wilts  nolee  If 

o(v)  •  0,  V  >  0  ; 

f(w)  -  1»  0  <  w  <  1  ; 

F(w)  •  w»  0  <  w  ^  1  . 

A  stationary  Gaussian  tlae  aerlas  with  aiaawbla 
correlation  function  and  intagrabla  log  spectral  dan- 
slcy  can  be  represented  In  ceraa  of  a  white  noise  tlae 
series  e(t)  representing  the  Innovationa  [prediction 
errors  Y'^(t)  •  Y(t)  -  Y“(t)  of  the  Infinite  aeaory  one- 
step  ahead  pradletor  Y^(e)  of  Y(C)].  The  AR(w)»  or 
Infinite  order  aucoregreeelve  repreeatitaclon,  it 

Y(t)+a^(l)Y(c-l)  ...+a^CD)Y(t-n)+...»  e(t)  . 

Tha  or  Infinite  order  aovlog  repreacntatloa,  is 

Y(t)-E(c)-H)^(l)c(c-l)-H)^(2)e(C-2)+. . . 

A  finite  peraattcer  representation  Is  an  ARMA(p,q)  of 
the  fora 

Y(t)+e  (l)Y(t-l)+...+a  (p)Y(t-p) 

P  P 

•  €(t)+b  (l)c(t-l)4-. .  .+b  (q)e(c-q) 

9  9 

The  filter  relating  Y(c)  end  e(C)  la  celled  a 
whitening  filter.  Peraaeter  eatiaetion  la  the  theory 
of  eeciaeclon  of  the  peraaatera  of  the  whitening  filter 
and  aodel  Identification  la  tha  theory  of  eeciaeclon  of 
the  atruccurel  fora  of  the  whitening  filter.  To  deve¬ 
lop  epproechee  to  pereaeter  eeciaeclon  for  e  randoa 
aeaple*  in  section  1  we  defined  the  following  concepts: 

Entropy  H(f)  , 

Maxlaun  entropy  density  f  . 

Croaa-antropy  H(g;f)  , 

Information  divergence  I(g;£), 

Hlnlaua  inforaaclon  divergence  density, 

Hinlaua  cross-entropy  density. 

Likelihood  of  a  aeaple  . 

Nexiaiai  likelihood  pereaecer  esclaetor. 

To  develop  approechee  to  eatiaetion  of  the  pere- 
aecera  6  of  a  pereaatrlc  aodel  f^Cw)  of  the  spectral 
density  f{w}  of  a  stationary  zero  aaan  Geuaalan  claa 
aerlas  Y<c),  we  develop  eaalogues  of  the  foregoing 
concepts.  We  start  with  an  epproxlaate  formula  for  tha 
likelihood  function 

40)  -  f  lot  {,(x(i) . ta)) 

of  the  tlae  series  aaaple.  We  esaiae  that  Y(c)  has 
bean  divided  by  (R(0)}H  so  chat  it  can  be  considered 
Co  have  variance  1,  and  Its  covariance  function  equeia 
its  correlation  function. 


2 


Th«  first  step  in  analyzing  a  time  series  should 
be  to  ooDpuca  the  sample  correlation  function 
T-v  T 

i(v)  -  I  Y(t)Y(f*^)  4  :  Y‘(t) 

t-l  t-1 

and  the  sample  spectral  density 

f(w)  .  :  :  *  :  Y2<t) 

t"l  t"i 


-  -2trlwv 
i.  e 

!  v|  <T 


o(v)  . 


It  should  be  emphasized  that  in  practice  one 
should  consider  using  a  "data  window"  to  compute  f(w). 
for  w  •  k/’Q,  k  ■  0,1,.,.,Q-1,  by 

f(w)  -  iii(w)|2  ♦  ^ 

^  k-0  ^  • 

T 

4»(w)  ■  I  Y(t)K(~)exp(-2wlwt) 
t-l  ^ 


where  here  f  denotes  the  true  prubabilitv  d>*naiiy  of 
the  sample,  and  is  a  toodel  for  f.  It  should  be 
noted  that  we  are  using  the  notation  f  and  with  a 
variety  of  meanings.  For  a  Gaussian  zero  mean  station¬ 
ary  time  series,  the  probability  density  of  the  sample 
is  specified  by  Che  spectral  densities,  f(w)  of  the 
true  distribution  and  fi^fw)  of  the  model.  We  continue 
to  denote  Che  information  divergence  by  (fQ;f)  but 
now  f  indlcaten  a  spectral  density  rather  chan  a  prob¬ 
ability  density.  Plnskar  (1963)  proves  a  formula  for 
1^  (f^;f)  In  the  limit  as  T  <■  -  : 

lia  (fg;f)  -  I(f9;f) 


where  l{f^;f)  is  the  Information  divergence  defined  as 
follows. 

For  two  spectral  densities  f  and  g,  the  Informe- 
tlon  divergence  l(g;f),  cross-entropy  H(g:f),  and 
entropy  H(f>  are  defined: 


/ 


0 


,f(w> 

■g(w) 


log 


f(w)  _ 
g(w) 


1) 


dw 


tor  e  suitable  kernel  K(x)  (properties  of  windows  are 
diecuaeed  in  Harris  (1978)).  In  addition  for  statis¬ 
tical  stability  oae  should  then  slightly  aonoth  f(w): 
(1)  compute  Che  sample  correlation  function  by 

1  k  -  k 

D<V)  "  Q  z  exp(2ffi§v)  f(^)  , 

^  k-0  ^  ^ 

which  holds  for  0  ^  v  ^  Q-T  (and  therefore  one  may 
want  to  chooae  Q  >  2T):  (2)  coispuce  a  slightly 

smooched  sample  spectral  density  by 

f(w)  *  I  exp(-2’Ttwv)  k(“)o(v) 

|vl<T  ” 

where  M  >  T/2  and  k(u)  Is  a  suitable  kernel,  such  as 
the  Parzen  lag  window: 

k(u)  •  I  -  6u^  +  bjul^  .  [u|  '  0.3  . 

-  2(1  -  iul)3  ,  0.5  1  Iu|  <  I 

-  0  ,  |ul  >  1  . 

Back  at  Che  likelihood  ranch,  one  may  show  that 
approxlaiacely 

-1^(6)  -  i  log  2w  +  Hdjif) 

where  ^ 

H(f  :f)  •  j  [  (log  f  (w)  +  1-^^  fdw 
A  0 

This  formula  for  likelihood  shows  chat  the  sam¬ 
ple  spectral  density  f(w)  is  a  sufficient  statistic 
for  a  time  series.  However,  it  Is  a  very  wlggly  func¬ 
tion  and  by  itself  is  not  a  consistent  estimator  of 
f(w).  Estimators  l(w)  of  f(w)  can  be  regarded  as 
"smoothings"  of  f(w),  but  the  basic  problem  is  how 
much  to  smooch. 

Another  aspect  of  Che  likelihood  formula  is  Its 
Juaclflcaclon  as  an  approximation.  To  those  misguided 
anelyste  for  whom  maximum  likelihood  provides  the 
ultimate  estlmetor  for  which  no  axpenoe  should  be 
spared,  there  la  no  substlcutc  for  chs  exact  llkall- 
hood  (which  of  course  Is  exact  only  if  the  model  being 
assumed  le  exactly  true).  Information  concepts  enter 
eeclmatlon  theory  when  one  recognizes  that  maximum 
likelihood  eeclmatlon  is  a  tachnical  devlea  for 
carrying  out  mlnlrntmi  Information  divergence  estima¬ 
tion.  The  information  diverganca  for  a  sample  Y(c), 
c  -  1,  ....  T,  le  defined  in  general  by 

fg(Y(l) . Y(T)) 

f(Y(l) .  ' 


-H(g;f)  -  H(f;f)  , 

H(g;f)-|  /  (log  g(w)  +  dw  . 

H(f)-H(f ;f)-4  /  flog  f(w)  +1)  dw  . 

*  0 

Since  u  -  log  u  -  I  0  for  ail  u,  t  has  two  of  the 
properties  of  a  distance:  l(g:f)  ^  0  ,  l(f;f)  -  0  . 
However  I  does  not  satisfy  the  triangle  inequality. 

The  Information  divergence  can  be  related  to  Che 
L2  log  spectral  density  distance 

L^L(f,g>  •  /  Uog  f(w)  -  log  g(w)}^  dw  , 

0 

using  the  fact  that  u  •  expdog  u)  •  1  4  log  u  4  (<f) 
(log  u)*  .  When  f  and  g  are  "neighbors"  In  the  sense 
chat  their  ratio  approximates  1, 

t(g;f)  -  j  LjL(f,g)  ; 

then  minimizing  1  is  equivalent  to  minimizing  L.L. 

An  extensive  discussion  of  these  distances  Is  given  by 
Gray,  Buzo,  Cray,  and  Hatsuyams  (1980). 

The  concepts  have  now  been  defined  to  state  some 
of  the  basic  facts  of  parameter  estimation  theory. 

Mexlmia  likelihood  estlaetors  9  are  etjulvelent 
to  sample  mlnloium  cross-entropy  estimators  9  defined 
by 

H(f.;f)  •  mtn  H(f.;f) 

9  ,)  0 

They  tan  be  regarded  as  estimators  of  the  population 
minlmm  crose-entropy  "parameters"  0*defined  by 

where  f  Is  the  true  spectral  density. 

A  maximtai  entropy  spectral  denelty  f  is  defined 
by 

H(f)  •  max  H(f) 
f 

where  f  is  constrained  to  satisfy  a  set  of  constraints 
of  Che  form 
I 

/  <W>  e(w)  dw  -  C  .  j  •  l . M, 

0  ^  ’ 

for  M  specified  funccione  k.(w)  and  conet.ints  C^. 

When  the  constraints  are  of  the  lorm 

/  e^’'*“*t(w)  dw  -  ..(J),  J  -  0.  4  1 . 4  m 

0 


^jflog 


Y(T)) 


% 


( 


1C  Ljjn  ht*  shown  chat  f(w>  che  autoregressive  spec* 
tr.-il  •Jenslcy  f  (w)  derined  as  fulluwn: 

ID 


■ .  ,  2nlw. ,  - 

<e  )| 


where 


(m)2®  , 

01  tB  ■ 

Che  autoregressive  coefficients  a  (l),...,a  (■) 
satisfy  normal  equations  (called  '\ule-WaLkSr 
equations) 

a 

y.  a  (k)o(k-J)  •  0,  1*1 . a  , 

k-0  ® 


where  a^(0)  1  ;  and 


m 

i;  "  :  a.(k)  fl(k) 

"  k-O  ” 

It  should  be  noted  chat  from  a  sequence  P(v). 
i  •••  quickly  compute  fg|(w)  for  all 

successive  values  of  a  •  1,2,...  using  a  variety  of 
faat  algorithms  [see  Kailach  (1974)1.  In  practice 
chv  problem  is  to  determine  "optimal"  values  of  m. 

Some  Important  properties  of  f  (w)  are: 

1  ■ 

(U 

/  f  (w)dw  •  o(J),  j  •  0,  +  1.  .. 

0  ® 

. a  ; 

(2) 

dw  .  1  ; 

1 

(3) 

i 

/  Log  £  (w)  dw  •  log  0^ 

0  "  “ 

(4) 

H(f  ;f)  ■  H(f  )  •  4  <  log  0-  +  U  ; 
m  a  2  "  m 

1 

(5) 

“  min  /  [  ,  .+x:  (w)  dw  ; 

. . c  0  ^  ■ 


(6)  g^fz)hasall  tcs  roots  In  the  complex  plane  out-> 
side  Che  unit  circle; 

(7)  ^IM  log  •  iog  jJ. 

loq  •  /  log  f(wj  dw  ; 

0 

^  differentiable  (the 

race  of  ronvergence  of  f  (w)  depends  on  the  rate 
of  convergence  of  to  o*  ); 

(lOi  21(f^;fJ  ■  Log  -  log  o*  •  0  as  a  *  “ 

The  foregoing  facts  expLsin  why  autoregressive 
spectral  approximations*  introduced  in  Perzco  (1968), 
(1969),  provide  powerful,  and  natural,  estimators  of 
an  unknown  spectral  denalty.  They  are  generated  by 
the  "maxlmui  entropy  approach"  Introduced  by  Burg 
(1967).  However  the  Burg  algorithm  dose  not  compute 
Che  autoregressive  coefficients  the  Innova' 

tlon  variances  <j^  by  the  YulS'^Walker  equations. 

Indeed  It  dome  not  compute  either  p(v)  or  ^(w).  U 
does  not  provide  Insight  into  how  to  identify 
"optlmel"  autoregressive  orders  a. 

•)ne  approach  to  defining  crltarle  for  an  optimal 
order  a  la  to  examine  how  well  one  hae  transformed  to 
white  nolae  Che  residual  earics 


whose  spectral  density  is  given  by 

f  (w»  •  I «  (e'  I  ^  f  (w) 


f  (w) 


A  "model  Identification"  determined  order  m  haa  Che 
property  that  the  spectral  distribution  function 

?  (u)-  /“T  (w')  dw'  .  0  <  w  <  1  . 

"  I) "  -  ' 

la  parsimoniously  not  significantly  different  from  the 
uniform  distribution  F()(w)  ■  w  representing  the  spec- 
tral  distribution  funccio-^  of  white  noise. 

In  deriving  autore  resslve  spsccral  eaclmacors 
or  approximators,  we  ha\  so  far  developed  an  analog 
of  TheorM  lA,  by  stating  that  autoragraasion  provides 
aeximum  entropy  estimators  siAjecc  to  Che  conacralnt 
that  certain  correlation  values  are  attained.  We 
prefer  an  analog  of  Theorem  IB,  which  acacee  chat: 

Che  minimum  information  divergence  paramater  astlma- 
to'ra  of  air^utoragreaalvm  mojal  for  tha  true  spectral 
denalty  are  provided  by  Che  coefflclants  which  satlsly 
cKe  Yule-tfalker  equations. 

A  very  laport^c  Tact  (that  may  not  ba  widaly 
known)  Is  that  the  maxlmuB  entropy  propertlaa  of  auCo*> 
regressive  spectral  denaltlas  follow  from  their  •lal‘> 
mum  information  divergence  properties,  using  Che  fact 
chat 

I(f^;f)  -  H(f„>  -  H(f)  >  0  ; 

m  ID  “ 

consequently  H(f)  <  K(fQ).  Since  t  and  fg|  satisfy 
the  constraint  that  their  first  m  correlations  squsl 
specified  values,  the  entropy  H(f)  achieves  Its  mjucl'- 

mum  value  at  f  •  f  . 

a 

3*  Autoregressive  apectral  tstlaators  and  order 
detarminiog  criteria. 

Given  a  sample  Y(c),  c  ••  1,  ....  T,  chert  are 
many  approaches  for  forming  autoregressive  spectral 
estimators,  because  (as  summarized  in  Parztn  (1901)] 
there  are  four  equivalent  ways  of  parametrizing  them: 
(A)  autoregressive  coefficients,  (B)  correlations, 
(C)  partial  correlstlona,  and  (D)  Innovation  varl- 
ancas.  Here  we  only  consider  starting  with  the  sample 
correlat ^L>n8  o(v).  Then  for  a  ■  1,  2,  ...  one  forma 

f,*")  •  ^  • 


where 

g^(2)  ■  1  +  a^(l)  *  +  *■  a^(m)z®  ; 

Che  sample  autoregreaalvs  coefflcienta  a  (J)  satisfy 
the  sample  Yule-Walker  equations  * 

m 

I  i  (k)  p(k-J)  -  0.  J  -  I . a  , 

k-0  * 

where  a^<0)  •  1;  and  cha  ordar  m  innovation  variance 
m 

■  t  a  (k)  p(k)  . 

■  k-O  ■ 

Define  <3^  by 

log  m  I  log  ?(w)  dw  . 

0 

Then  as  m  tenda  to  T, 


f.^(t)  •  Y(t)  a^(l)  Y(t-l)  ♦...s-  .i^(«)Y(t-m) 


2I(f  ;?)  •  log  -  log 

*  ■  • 


0 


I 


Ue  d«alre  to  be  a  sequence  of  roneletent  escl- 
nators  of  f  tn  the  sense  chat  il  one  chooses  m  as  .1 
suitable  function  of  T,  then  as  T  ••  • 

Iff  ;f)  -  0  and  f  fw)  -  ffw)  , 
o  a 

in  probability,  or  with  probability  one,  or  in  oean 
square.  The  firac  rigorous  proof  of  such  results  was 
given  by  Berk  (1974)  who  also  flnda  the  asymptotic 
variance  of  fg(w), confirming  conjectures  in  Parren(1969. 

We  now  consider  the  problem  of  choosing  m  adap¬ 
tively  from  a  sample  of  alac  T.  Conceptually  one 
would  like  to  choose  m  to  olnlmlae  I(^^;f).  One 
approach  to  such  a  procedure  is  given  by  Akaike  (1974) 
and  leads  to  an  order  determining  criterion  called  AlC. 

The  Akaike  Information  criterion  computes  tor 
m  •  1,2.  . . . 

AIC(«)  -  log  Y 

and  an  optimal  order  m  satisfying 

AlC(m)  "  min  AZC(ffl) 
m 

The  optimal  order  m  can  equal  0,  Indicating  that 
the  time  series  is  white  noise.  The  value  of  AIC(O) 
can  be  adjusted  to  the  value  one  desires  for  the  pro¬ 
bability  of  rejecting  the  hypothesis  of  white  noise, 
when  In  fact  the  time  series  is  white  noise.  We 
recommend 

AIC(O)  •  -  i  . 

Parzen  (1974),  (1977)  proposes  autoregressive 
order  determining  criteria,  called  CAT,  whose  founda¬ 
tions  are  different  from  those  of  AlC  but  which  usually 
lead  to  exactly  equivalent  orders  in  practice.  The 
time  aeries  aodcl  idencif icaclon  problem  lo  to  esci- 
mace  the  infinite  auCoregreesive  transfer  function 


.tn  estimator  of  Che  terms  In  )  which  depend  on 
thus  one  Alnlmi/es 


is  an  “unbiased'*  estioacor  of  'i~.  At  m  ■  0,  we 
assign  CAT(O)  -  -  (I  +  (1/T)).  ^ 

It  should  be  noted  that  a  multiple  cine  series 
version  of  CAT  Is  given  tn  Parzen  (1977). 

An  order  determining  criterion  which  Is  consis¬ 
tent.  but  whose  behavior  in  pr.ictlce  Is  controversial. 
Is  given  by  Hannan  and  i)uinn  ^1979). 


boa  Spectral  Kernel  bsciaator  and  Cepstral 
Correlations 


The  approach  we  have  been  deecriblng  for  forming 
“opclAar*  estimators  f (w)  of  the  spectral  density  f(w) 
of  a  stationary  time  aeries  is  to  view  f(w)  ee  a 
function  closest  to  ^(w)  In  s  distance  between  spec¬ 
tral  densities  given  by  the  Information  divergence 
t(f;f).  The  class  of  functions  from  which  i(y)  is 
chosen  has  been  constrained  (or  specified)  parametri¬ 
cally,  In  the  sense  that  fCw)  is  of  the  form  fsfw), 
where  0  eatlmaces  the  parameters  9  of  a  model  i^Cv) 
for  Che  true  f(w). 

A  non-paraouicric  constraint  is  to  impoae  a 
smoothness  measure  on  C  such  as  the  integral  square 
of  Che  r-ch  derivative  of  log  f(w),  denoted 

/  1 (log  f (w>) 1 ^  dw  . 

0 


g^(z)  ••  1  +  a^(l)  t  +■  ...  +  *  ••• 

by  a  sample  order  m  autoregressive  transfer  function 
V*)-  To  evaluste  the  overall  mean  square  error  it 
le  cooveoienc  to  define 

I  ^  ^  g,(«^’'‘'')|2f(w)4v 

0  ■  • 

which  can  be  shown  to  be  the  3\m  of  a  variance  term 

^  ^  I  ^  <<w)<iw 

0  m  m 


and  a  bias  term 


^  t<w)dw 


One  then  seeks  Co  choose  f  to  maximize  smoothness, 
while  minimizing  a  meeaure  of  distance  of  f  from  f. 
Wahbe  (1980)  Introduces  the  esCimarion  distance 

1  .  ,  1  ,  . 

/  iTiQg  f(w)  -  log  f(w)l^dvHl/  lllog  fCw))^*^M^<iw 
0  0 

where  K  la  a  penally  parameter  to  be  determined  adap¬ 
tively  by  Che  data.  One  may  show  that  the  resulting 
estimators  of  g(w)  •  log  f(v)  are  of  the  form,  called 
log  spectral  kernel  estimators, 

g(w)  -  (log  f(w»-  Z  •xp(-2Hlwv>k(g)  f(v) 

V* 

where 

1 

y(v)  -  /  exp  (2vlwv)  log  f(w)  dw 
0 


One  can  show  t)iat  the  variance  term  la  approximately 


1 

T 


r 

J*i 


o 


-2 


are  called  cepatral  correlatlona.  and  the  kernel  k<x> 
le  given  by  ^compare  Perzen  (1958)) 


One  often  conatdere  only  two  values  for  r,  2  and  4. 

Tha  statiaclcal  propereiee  of  cepecral  correla¬ 
tion  have  been  extensively  inveetigaced  by  Bhanaall 
a97t). 

Stnem  k(^)  -  j  for  v  -  H,  w  c*ll  M  ch,  "half 

povar"  l«g.  W*  ,nlt  to  «dftptiv«ly  dotanln,  M  from 
th,  ,wi,l«  to  ■Inlali,  tho  ri,lt  functtoo 

11^  -  J(f;f)  -  E  LjL(f,f) 


issualng  log  f(w)  lias  i  roprt*:!ientatio(t 

glw)  -  log  f(w)  •  exp(-2Tlw)  >(v>  . 

Following  Wahba  (L980),  to  alninlze  on«!  miniaizefl 
an  esciAator  of  It  of  the  fora 

^  -  B(M)  +  V(H,T) 

where  B(M)  and  V(M,T)  are  meaaureii  of  blaa  and  varl- 
ance  given  by 

B(«l  -  ir  -  (y(v))^v'*''(l  +  . 

V(M,T)  -  ?  7^  4  r  O’-  <iu  . 

'  “  0 

A  closed  fora  evaluation  of  the  Integral  in  V<M,T) 
can  be  obtained. 

S.  [terated  spectral  estimation. 

ObSs^rved  tioe  seriea  do  not  usually  obey  the 
aaauaptions  aade  In  the  foregoing  theory  chat  Y(t)  is 
a  zero  seen  Gaussian  dee  series  with  aienable  corre¬ 
lation  function.  We  call  such  a  dee  series  a  '*3hort 
aaeocy”  dee  serree  (of  which  white  noise  is  a  special 
caae.  called  a  "no  eeeory"  time  series).  Otherwise 
the  doe  series  Is  called  "long  mesury"  (Parren(1962)X 
Autoregressive  spectral  estimacors  are  especial¬ 
ly  tuicabic  for  matching  the  large  scale  oacilladons 
of  the  spectral  density  of  a  long  oeBory  time  series. 
The  role  of  the  autoregressive  filter  Is  then  to 
transform  the  time  series  to  a  short  memory  dac 
series  [obtained  as  Che  residuals  ''•^(c)  described  In 
section  21.  The  spectral  density  of  the  short  memory 
seri'i;^*  which  can  be  regarded  as  Che  fine  structure 
of  Che  orlglnel  spectral  density,  can  be  esclmsted  by 
a  log  spectral  smooching  estlmscor  as  well  as  by  an 
autoregressive  spectral  estimator.  Employing  two 
different  approaches  to  short  memory  spectral  escims- 
don  la  desirable  since  the  problem  of  spectral 
esclmsdon  is  not  simply  a  problem  of  parameter 
escimscion  but  is  also  one  of  model  Idendf icetion. 

itersced  models  for  forecasting  Long  memory  time 
scries  are  used  by  Parzen  (1961)  under  the  name  of 
"ARAJWA  modela"  (sec  Appendix  for  an  example). 
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APPENDIX:  Wolfef  Sunspot  Numbers  1846-1963. 

To  illustrsce  the  application  of  some  of  the 
foregoing  ideas,  we  report  an  iterated  autoregressive 
model  fitted  to  the  annual  time  series  Y(c)  of  Uolfer's 
sunspot  data  for  the  years  1846-1963  (which  Is  a  sample 
of  length  T  *  118).  Our  ARARMA  model  fitting  algo¬ 
rithm  automeclcelly  proposes  the  following  model  (which 
it  hopes  will  have  the  best  medium  range,  If  not  Long 
range,  forecasting  capability): 

Y(t)  -  Y(t)  -  .482  Y(t-lO)  -  .554  Y<t-U) 

Vft)  -  1.009  y(t-l)  >  .362  y(e-2)  •  t(t) 

The  series  Y<e;  is  a  short  asmory  time  ssrles  to  which 
Y(t)  has  beau  transformed  by  the  initial  autoregress¬ 
ion  on  Y(c).  As  an  sstlmstor  of  the  true  log  spectral 
density  f(w)  ve  take,  up  to  a  normalizing  constant. 

\lo%  c^iw)  f  •  tlog  f^iw)-'  log  f„,(w) 


where 

'  s  ‘ 

la  ths  sutoregrsssive  spectral  density  corresponding 
to  Che  transformation  from  Y(t)  to  Y(c). 

Figurs  1  Is  s  graph  of  the  Uolfer  sunspot  data 
(the  crossss  represent  the  one-step  ahead  predictors 
of  Che  model  above).  Figure  2  graphs  the  Iterated 
autoregressive  log  spectre!  esclaator  Uog  f^fw))' 
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